AITopics | convolutional vision transformer

Collaborating Authors

convolutional vision transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

Qian, Zhoujie

arXiv.org Artificial IntelligenceApr-22-2025

--Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. T o address these limitations, we propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers. ECViT introduces inductive biases such as locality and translation invariance, inherent to Convolutional Neural Networks (CNNs) into the Transformer framework by extracting patches from low-level features and enhancing the encoder with convolutional operations. Additionally, it incorporates local-attention and a pyramid structure to enable efficient multi-scale feature extraction and representation. Experimental results demonstrate that ECViT achieves an optimal balance between performance and efficiency, outperforming state-of-the-art models on various image classification tasks while maintaining low computational and storage requirements. ECViT offers an ideal solution for applications that prioritize high efficiency without compromising performance. Transformers use self-attention [1] to model long-range dependencies, revolutionizing how models handle sequential data. The Vision Transformer (ViT) [2] treats images as sequences of patches and uses the self-attention to capture global dependencies, which has made a successful transition from natural language processing (NLP) to computer vision (CV).

artificial intelligence, machine learning, transformer, (15 more...)

arXiv.org Artificial Intelligence

2504.14825

Genre:

Research Report > New Finding (0.66)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

A Convolutional Vision Transformer for Semantic Segmentation of Side-Scan Sonar Data

Rajani, Hayat, Gracias, Nuno, Garcia, Rafael

arXiv.org Artificial IntelligenceFeb-23-2023

Distinguishing among different marine benthic habitat characteristics is of key importance in a wide set of seabed operations ranging from installations of oil rigs to laying networks of cables and monitoring the impact of humans on marine ecosystems. The Side-Scan Sonar (SSS) is a widely used imaging sensor in this regard. It produces high-resolution seafloor maps by logging the intensities of sound waves reflected back from the seafloor. In this work, we leverage these acoustic intensity maps to produce pixel-wise categorization of different seafloor types. We propose a novel architecture adapted from the Vision Transformer (ViT) in an encoder-decoder framework. Further, in doing so, the applicability of ViTs is evaluated on smaller datasets. To overcome the lack of CNN-like inductive biases, thereby making ViTs more conducive to applications in low data regimes, we propose a novel feature extraction module to replace the Multi-layer Perceptron (MLP) block within transformer layers and a novel module to extract multiscale patch embeddings. A lightweight decoder is also proposed to complement this design in order to further boost multiscale feature extraction. With the modified architecture, we achieve state-of-the-art results and also meet real-time computational requirements. We make our code available at ~\url{https://github.com/hayatrajani/s3seg-vit

artificial intelligence, convolutional vision transformer, machine learning, (2 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.oceaneng.2023.115647

2302.12416

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas > Upstream (0.89)

Technology:

Information Technology > Artificial Intelligence > Vision (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.53)

Add feedback

Vision Transformers or Convolutional Neural Networks? Both!

#artificialintelligenceAug-4-2021, 14:05:19 GMT

The field of Computer Vision has for years been dominated by Convolutional Neural Networks (CNNs). Through the use of filters, these networks are able to generate simplified versions of the input image by creating feature maps that highlight the most relevant parts. These features are then used by a multi-layer perceptron to perform the desired classification. But recently this field has been incredibly revolutionized by the architecture of Vision Transformers (ViT), which through the mechanism of self-attention has proven to obtain excellent results on many tasks. If this in-depth educational content is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.

architecture, transformer, vision transformer, (15 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.55)

Add feedback

Vision Transformers or Convolutional Neural Networks? Both!

#artificialintelligenceJul-23-2021, 21:35:39 GMT

Through the use of filters, these networks are able to generate simplified versions of the input image by creating feature maps that highlight the most relevant parts. These features are then used by a multi-layer perceptron to perform the desired classification. But recently this field has been incredibly revolutionized by the architecture of Vision Transformers (ViT), which through the mechanism of self-attention has proven to obtain excellent results on many tasks. In this article some basic aspects of Vision Transformers will be taken for granted, if you want to go deeper into the subject I suggest you read my previous overview of the architecture. Although Transformers have proven to be excellent replacements for CNNs, there is an important constraint that makes their application rather challenging, the need for large datasets.

architecture, transformer, vision transformer, (15 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback